This notebook dives deeper into using CellAssign with
marker gene sets derived from the PanglaoDB
database. Previously we had compared using marker gene sets from a
single organ or tissue type using two different databases,
PanglaoDB or CellMarker2.0. We liked the
organization of PanglaoDB, other than the missing cell
ontology IDs, so want to see how CellAssign performs when
using marker gene sets obtained by combining cell types across
organs.
Here we run CellAssign using two different gene sets and
compare the results to the previous CellAssign results
using only cell types found in the muscle/connective tissue and to the
submitter annotations.
combined refers to a reference gene set that contains
all cell types found in muscle, connective tissue, and the immune
system.combined_custom refers to a reference gene set that
contains all cell types found in muscle and connective tissue and only
Monocytes from the immune system. This is because the only overlap
between the submitter annotations and the cells in the immune system
found in PanglaoDB are Monocytes.muscle_only refers to a reference gene set that
contains only cell types found in muscle and connective tissue. The
predictions from using this reference gene set were obtained by running
CellAssign in the cell-assign-sarcoma.Rmd
notebook.library(SingleCellExperiment)
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps,
colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates,
colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse,
rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans,
rowWeightedMedians, rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit,
which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
library(ggplot2)
# source helper functions
function_file <- file.path("..", "utils", "cellassign-helper-functions.R")
source(function_file)
# Set up paths
# build path to annotated sce file
processed_data_dir <-file.path("..", "data")
processed_sce_file <- glue::glue("{params$library_id}_processed_celltype.rds")
local_processed_sce_path <- file.path(processed_data_dir, params$sample_id, processed_sce_file)
# anndata output
anndata_dir <- file.path("..", "data", "anndata")
anndata_file <- glue::glue("{params$library_id}_anndata.hdf5")
anndata_file <- file.path(anndata_dir, anndata_file)
# reference files for cell marker
ref_dir <- file.path("..", "references")
panglao_file <- file.path(ref_dir, "PanglaoDB_markers_27_Mar_2020.tsv")
# matrix files
panglao_comb_mtx_file <- file.path(ref_dir, "panglao_combined_mtx.csv")
panglao_custom_mtx_file <- file.path(ref_dir, "panglao_combined-custom_mtx.csv")
panglao_muscle_only_mtx_file <- file.path(ref_dir, "panglao_muscle_mtx.csv")
# predictions output
cellassign_outs_dir <- "cellassign_results"
# cellassign predictions
combined_prediction_file <- file.path(
cellassign_outs_dir,
glue::glue("{params$library_id}_panglao_combined.tsv")
)
muscle_only_prediction_file <- file.path(
cellassign_outs_dir,
glue::glue("{params$library_id}_panglao_muscle.tsv")
)
custom_prediction_file <- file.path(
cellassign_outs_dir,
glue::glue("{params$library_id}_panglao_combined-custom.tsv")
)
# if the file isn't present, grab it from aws
if(!file.exists(local_processed_sce_path)){
filename <- basename(local_processed_sce_path)
local_data_dir <- file.path("..", "data", params$sample_id)
s3_dir <- "s3://sc-data-integration/scpca/celltype_sce"
fs::dir_create(local_data_dir)
sync_call <- glue::glue('op run -- aws s3 sync {s3_dir} {local_data_dir} --exclude "*" --include "{filename}"')
system(sync_call, ignore.stdout = TRUE)
}
# read in annotated sce
processed_sce <- readr::read_rds(local_processed_sce_path)
if(!file.exists(anndata_file)){
# write out anndata
scpcaTools::sce_to_anndata(processed_sce, anndata_file = anndata_file)
}
# read in panglao db
panglao_df <- readr::read_tsv(panglao_file)
Rows: 8286 Columns: 14
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): species, official gene symbol, cell type, nicknames, product description, gene type, germ layer, organ
dbl (6): ubiquitousness index, canonical marker, sensitivity_human, sensitivity_mouse, specificity_human, specificity_mouse
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
muscle_cells <- c("Connective tissue", "Smooth muscle", "Skeletal muscle")
# grab all muscle related organs + all immune
panglao_tissue_immune_combined <- panglao_df |>
dplyr::filter(organ %in% c(muscle_cells, "Immune system"))
panglao_tissue_immune_combined |>
dplyr::count(`cell type`, `organ`)
# grab all muscle related organs + monocytes only
custom_marker_genes <- panglao_df |>
dplyr::filter(organ %in% muscle_cells |
`cell type` == "Monocytes")
custom_marker_genes |>
dplyr::count(`cell type`, `organ`)
# list of matrix files to create
mtx_files <- c(panglao_comb_mtx_file,
panglao_custom_mtx_file)
# if any of them don't exist, make them
if(!any(file.exists(mtx_files))){
# create rowdata df
rowdata_df <- rowData(processed_sce) |>
as.data.frame() |>
tibble::rownames_to_column("ensembl_id") |>
dplyr::select(ensembl_id, gene_symbol)
# create list of marker gene sets
all_marker_genes <- list(panglao_tissue_immune_combined,
custom_marker_genes)
names(all_marker_genes) <- c("combined", "combined_custom")
# get binary mtx for combined tissues
binary_matrix_list <- all_marker_genes |>
purrr::map(\(gene_list) build_binary_mtx(marker_genes_df = gene_list,
celltype_column = 'cell type',
gene_id_column = 'official gene symbol',
rowdata_df = rowdata_df))
# export mtx
purrr::walk2(binary_matrix_list, mtx_files,
\(binary_mtx, file)
readr::write_csv(binary_mtx, file))
}
# list of output prediction files
prediction_files <- c(combined_prediction_file,
custom_prediction_file)
# run cell assign with muscle ref
cellassign_calls <- purrr::map2(prediction_files, mtx_files,
\(predictions, marker_genes) {
glue::glue("python ../scripts/cell-assign.py \
--input_anndata '{anndata_file}' \
--output_predictions '{predictions}' \
--reference '{marker_genes}'")
})
system(cellassign_call)
# add muscle only to prediction file list
prediction_files <- c(prediction_files, muscle_only_prediction_file)
names(prediction_files) <- c("combined", "combined_custom", "muscle_only")
# read in prediction scores
all_predictions <- prediction_files |>
purrr::map(readr::read_tsv)
Rows: 5948 Columns: 41
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): barcode
dbl (40): Adipocyte progenitor cells, Adipocytes, Airway smooth muscle cells, B cells, B cells memory, B cells naive, Basophils, Chondrocytes, Dendritic cells, Eosinophils, Fibroblasts, Gamma del...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 5948 Columns: 17
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): barcode
dbl (16): Adipocyte progenitor cells, Adipocytes, Airway smooth muscle cells, Chondrocytes, Fibroblasts, Monocytes, Myoblasts, Myocytes, Myoepithelial cells, Myofibroblasts, Pulmonary vascular sm...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 5948 Columns: 16
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): barcode
dbl (15): Adipocyte progenitor cells, Adipocytes, Airway smooth muscle cells, Chondrocytes, Fibroblasts, Myoblasts, Myocytes, Myoepithelial cells, Myofibroblasts, Pulmonary vascular smooth muscle...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# get final cell type assignments
all_assignments <- purrr::map(all_predictions,
get_celltype_assignments)
# print table of assignments
all_assignments |>
purrr::map(\(celltype_assignments){
# print out table of assignments
celltype_assignments |>
dplyr::count(celltype) |>
dplyr::arrange(desc(n))
})
$combined
# A tibble: 29 × 2
celltype n
<chr> <int>
1 other 4946
2 Fibroblasts 227
3 Megakaryocytes 162
4 T memory cells 133
5 Dendritic cells 106
6 Monocytes 99
7 Myocytes 76
8 Airway smooth muscle cells 31
9 B cells 29
10 Plasma cells 19
# … with 19 more rows
# ℹ Use `print(n = ...)` to see more rows
$combined_custom
# A tibble: 14 × 2
celltype n
<chr> <int>
1 other 4914
2 Monocytes 535
3 Fibroblasts 237
4 Myocytes 107
5 Stromal cells 47
6 Airway smooth muscle cells 30
7 Vascular smooth muscle cells 19
8 Myoblasts 16
9 Satellite cells 12
10 Myoepithelial cells 11
11 Myofibroblasts 9
12 Smooth muscle cells 6
13 Adipocyte progenitor cells 3
14 Chondrocytes 2
$muscle_only
# A tibble: 14 × 2
celltype n
<chr> <int>
1 other 5261
2 Fibroblasts 284
3 Myocytes 161
4 Stromal cells 135
5 Airway smooth muscle cells 28
6 Satellite cells 17
7 Myoepithelial cells 16
8 Myofibroblasts 11
9 Vascular smooth muscle cells 9
10 Smooth muscle cells 8
11 Adipocyte progenitor cells 6
12 Chondrocytes 6
13 Myoblasts 5
14 Adipocytes 1
# heatmap of prediction scores
all_predictions |>
purrr::iwalk(\(prediction_mtx, organ_type){
prediction_mtx |>
dplyr::select(-barcode) |>
as.matrix() |>
pheatmap::pheatmap(show_rownames = FALSE,
main = organ_type)
})
The main conclusion after looking at this is that regardless of the marker gene set, most cells get classified as “other”
Here we directly compare the annotations between different marker gene sets and the submitter annotations.
# add celltypes to processed sce
processed_sce$combined_assignments <- all_assignments$combined$celltype
processed_sce$combined_custom_assignments <- all_assignments$combined_custom$celltype
processed_sce$muscle_only_assignments <- all_assignments$muscle_only$celltype
# extract coldata for easy plotting
coldata_df <- colData(processed_sce) |>
as.data.frame() |>
dplyr::select(barcode, celltype, combined_assignments, combined_custom_assignments, muscle_only_assignments) |>
# classify cells as tumor or normal
dplyr::mutate(cell_category = dplyr::if_else(stringr::str_detect(celltype, "Tumor"), "Tumor", celltype),
# remove sub states of myoblast/mesoderm/myocyte
broad_celltype = stringr::str_remove(celltype, "-\\w$"))
compare_refs_heatmap(original_assignment = coldata_df$combined_assignments,
cell_assign = coldata_df$muscle_only_assignments,
title = "Combined vs. Muscle only")
compare_refs_heatmap(original_assignment = coldata_df$combined_assignments,
cell_assign = coldata_df$broad_celltype,
title = "Combined vs. Submitter annotations")
compare_refs_heatmap(original_assignment = coldata_df$combined_custom_assignments,
cell_assign = coldata_df$muscle_only_assignments,
title = "Combined-custom vs. Muscle only")
compare_refs_heatmap(original_assignment = coldata_df$combined_custom_assignments,
cell_assign = coldata_df$broad_celltype,
title = "Combined-custom vs. Submitter annotations")
compare_refs_heatmap(original_assignment = coldata_df$muscle_only_assignments,
cell_assign = coldata_df$broad_celltype,
title = "Muscle only vs. Submitter annotations")
This section will compare the various cell type annotations to the cluster assignments. For simplicity, we will use louvain-jaccard graph-based clustering with the default nearest neighbors parameter of 10. We will look at UMAPs and heatmaps of the cluster assignments and cell type assignments. We expect that any assigned cell types should be somewhat correlated to the clusters, in particular for cell types that are less frequent. This means that a single cell type is probably less likely to be spread across multiple clusters.
# cluster sce object
pcs <- reducedDim(processed_sce, "PCA")
clusters <- bluster::clusterRows(pcs,
bluster::NNGraphParam(
cluster.fun = "louvain",
type = "jaccard"
))
top_n_labels <- 10
# pull out the cell type columns that we want to plot
umap_labels <- c("cluster_assignment",
"broad_celltype",
"combined_assignments",
"combined_custom_assignments",
"muscle_only_assignments")
umap_df <- coldata_df |>
# add clusters and UMAP embeddings to coldata
dplyr::mutate(
cluster_assignment = clusters,
UMAP_1 = reducedDim(processed_sce, "UMAP")[,1],
UMAP_2 = reducedDim(processed_sce, "UMAP")[,2]
) |>
# combine all the cell type assignments into one column
tidyr::pivot_longer(cols = all_of(umap_labels),
names_to = "label_type",
values_to = "cell_label") |>
# relabel other to distinguish from "other" added by forcats
# Cells categorized by other in CellAssign become Not assigned
dplyr::mutate(cell_label = dplyr::if_else(cell_label == "other",
"Not assigned",
cell_label)) |>
dplyr::group_split(label_type) |>
# only grab the top cell types for each cell type assignment method
# all other cell types will get grouped as "other"
purrr::map(\(df)
dplyr::mutate(df,
top_labels = forcats::fct_lump_n(
df$cell_label,
n = top_n_labels
))) |>
dplyr::bind_rows()
# create a separate umap for each of the cluster/ cell type assignments
# only plot the top cell types otherwise it gets overwhelming with colors
purrr::map(umap_labels,
\(label){
plotting_df <- umap_df |>
dplyr::filter(label_type %in% label)
ggplot(plotting_df, aes(x = UMAP_1, y = UMAP_2, color = top_labels)) +
geom_point(size = 0.2, alpha = 0.5) +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(color = label, title = label) +
theme_bw()
})
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
# heatmaps comparing clusters to cell type assignments
heatmap_labels <- umap_labels[umap_labels != "cluster_assignment"]
purrr::map(heatmap_labels,
\(label) compare_refs_heatmap(coldata_df[, label],
clusters,
cluster_cols = FALSE,
cluster_rows = TRUE,
title = glue::glue("{label} across clusters")))
[[1]]
[[2]]
[[3]]
[[4]]
It looks like the smaller immune cell populations (e.g., the T memory cells and Dendritic Cells in the combined annotation) are spread throughout the same cluster that contains cells that are “Not assigned”. These correlate with tumor cells in the submitter annotations. I would have maybe expected to see those cells in their own cluster or separate from the clusters containing tumor cells.
PanglaoDB most of the cells get categorized as “other” for
this sample. One thing to note is that when using SingleR
we are able to assign cells to muscle cells and
skeletal muscle cells. Although the assignments from
SingleR (see singler-rms-comparison.Rmd) don’t
completely line up with the submitter annotations, identifying cells as
muscle cells rather than labeling most of them as “other” does seem more
informative to me.CellAssign
takes. We already knew this, but this was a new record. It took ~ 2
hours to run the immune + connective tissue with 8 cpus.sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] ggplot2_3.3.6 SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.3
[7] IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0 MatrixGenerics_1.8.1 matrixStats_0.62.0
loaded via a namespace (and not attached):
[1] circlize_0.4.15 AnnotationHub_3.4.0 BiocFileCache_2.4.0 igraph_1.3.4 splines_4.2.1 BiocParallel_1.30.3
[7] scater_1.24.0 digest_0.6.29 foreach_1.5.2 htmltools_0.5.3 viridis_0.6.2 fansi_1.0.3
[13] magrittr_2.0.3 memoise_2.0.1 ScaledMatrix_1.4.0 cluster_2.1.3 doParallel_1.0.17 limma_3.52.2
[19] tzdb_0.3.0 ontologyPlot_1.6 openxlsx_4.2.5.2 miQC_1.4.0 ComplexHeatmap_2.12.1 Biostrings_2.64.0
[25] readr_2.1.2 vroom_1.5.7 colorspace_2.0-3 blob_1.2.3 rappdirs_0.3.3 ggrepel_0.9.1
[31] xfun_0.32 dplyr_1.0.9 jsonlite_1.8.0 crayon_1.5.1 RCurl_1.98-1.8 graph_1.74.0
[37] iterators_1.0.14 glue_1.6.2 gtable_0.3.0 zlibbioc_1.42.0 XVector_0.36.0 GetoptLong_1.0.5
[43] DelayedArray_0.22.0 BiocSingular_1.12.0 Rgraphviz_2.40.0 shape_1.4.6 scales_1.2.0 pheatmap_1.0.12
[49] edgeR_3.38.4 DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.0 xtable_1.8-4 clue_0.3-64
[55] dqrng_0.3.0 bit_4.0.4 rsvd_1.0.5 DT_0.26 metapod_1.4.0 htmlwidgets_1.5.4
[61] httr_1.4.3 ontologyIndex_2.11 RColorBrewer_1.1-3 modeltools_0.2-23 ellipsis_0.3.2 farver_2.1.1
[67] flexmix_2.3-18 pkgconfig_2.0.3 scuttle_1.6.2 nnet_7.3-17 sass_0.4.2 dbplyr_2.2.1
[73] locfit_1.5-9.6 utf8_1.2.2 labeling_0.4.2 tidyselect_1.1.2 rlang_1.0.4 later_1.3.0
[79] AnnotationDbi_1.58.0 munsell_0.5.0 BiocVersion_3.15.2 tools_4.2.1 cachem_1.0.6 cli_3.3.0
[85] generics_0.1.3 RSQLite_2.2.15 evaluate_0.16 stringr_1.4.0 fastmap_1.1.0 yaml_2.3.5
[91] knitr_1.39 bit64_4.0.5 fs_1.5.2 zip_2.3.0 purrr_0.3.4 KEGGREST_1.36.3
[97] sparseMatrixStats_1.8.0 mime_0.12 scran_1.24.0 compiler_4.2.1 rstudioapi_0.13 beeswarm_0.4.0
[103] filelock_1.0.2 curl_4.3.2 png_0.1-7 interactiveDisplayBase_1.34.0 paintmap_1.0 statmod_1.4.37
[109] tibble_3.1.8 bslib_0.4.0 stringi_1.7.8 highr_0.9 scpcaTools_0.1.8 forcats_0.5.2
[115] lattice_0.20-45 bluster_1.6.0 Matrix_1.4-1 vctrs_0.4.1 pillar_1.8.0 lifecycle_1.0.1
[121] BiocManager_1.30.18 jquerylib_0.1.4 GlobalOptions_0.1.2 BiocNeighbors_1.14.0 bitops_1.0-7 irlba_2.3.5
[127] httpuv_1.6.5 R6_2.5.1 promises_1.2.0.1 renv_0.15.5 gridExtra_2.3 vipor_0.4.5
[133] ontoProc_1.18.0 codetools_0.2-18 assertthat_0.2.1 rjson_0.2.21 withr_2.5.0 GenomeInfoDbData_1.2.8
[139] parallel_4.2.1 hms_1.1.1 grid_4.2.1 beachmat_2.12.0 tidyr_1.2.0 rmarkdown_2.14
[145] DelayedMatrixStats_1.18.0 shiny_1.7.2 ggbeeswarm_0.6.0